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MODEL MISSPECIFICATION IN PEAKS OVER 
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Ecole Polytechnique Federate de Lausanne 

Classical peaks over threshold analysis is widely used for sta- 
tistical modeling of sample extremes, and can be supplemented by 
a model for the sizes of clusters of exceedances. Under mild condi- 
tions a compound Poisson process model allows the estimation of 
the marginal distribution of threshold exceedances and of the mean 
cluster size, but requires the choice of a threshold and of a run param- 
eter, K, that determines how exceedances are declustered. We extend 
a class of estimators of the reciprocal mean cluster size, known as the 
extremal index, establish consistency and asymptotic normality, and 
use the compound Poisson process to derive misspecification tests of 
model validity and of the choice of run parameter and threshold. Sim- 
ulated examples and real data on temperatures and rainfall illustrate 
the ideas, both for estimating the extremal index in nonstandard 
situations and for assessing the validity of extremal models. 

1. Introduction. When extreme- value statistics are applied to time se- 
ries it is common to proceed as though the data are independent and identi- 
cally distributed, although they may be nonstationary with complex covari- 
ate effects and with rare events generated by several different mechanisms. 
Moreover, models that are mathematically justified only as asymptotic ap- 
proximations may be fitted to data for which these approximations are poor. 
In this paper we suggest diagnostics for failure of these models and illustrate 
their application. 

A standard approach to modeling the upper tail of a distribution is the 
so-called peaks over threshold procedure [Davison and Smith (1990)], under 
which a threshold u is applied to data x\,...,x n from a supposedly station- 
ary time series, leaving N positive exceedances Xj — u. Extrapolation beyond 
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the tail of the data is based on a fit of the generalized Pareto distribution 
[Pickands (1975)] 



to the N exceedances, treated as independent. The parameters in (1) are a 
scale parameter a > and a shape parameter that controls the weight 

of the tail of the distribution, whose rth moment exists only if r£ < 1. In 
many applications £ appears to lie in the interval (—0.5, 0.5), but uncertainty 
about its value generally leads to alarmingly wide confidence intervals for 
quantities of interest such as a return period T or a return level xt] these 
satisfy pr(AT > xt) = 1/T. If the time series is white noise, n is large and 
N/n is small, then exceedances of u appear as a Poisson process, and under 
mild conditions we may use the tail approximation pr(X > x) ~ (N/n){l — 
H(x — u)}, where H is the estimate of ( 1 ) . A crucial preliminary to using such 
methods is the choice of threshold u, which is usually performed graphically 
using stability properties of (1): if Y ~ H and u > 0, then conditional on 
Y > u the exceedance Y — u has distribution (1) with parameters £ and a' = 
a + ^u; and if £ < 1, then the mean residual life E(Y — u \ Y > u) = a'/ (1 — £). 
It is standard practice to plot empirical versions of these quantities for a 
range of potential thresholds, and to use only values of u above which the 
estimates appear to be stable and the empirical mean residual life appears 
to be linear. See Coles [(2001), Chapter 4] or Beirlant et al. (2004) for more 
details, and Beirlant, Vynckier and Teugels (1996) and Sousa and Michailidis 
(2004) for variants of the last plot intended to stabilize it in the presence of 
heavy-tailed data. 

This approach to tail modeling is based on a general and well-developed 
probabilistic theory of extremes [Leadbetter, Lindgren and Rootzen (1983), 
Embrechts, Kliippelberg and Mikosch (1997), Falk, Hiisler and Reiss (2004)] 
and is widely used: in 2008 alone the Web of Science records around 450 
articles in which the terms "peaks over threshold" or "generalised Pareto 
distribution" appear in the abstract or title. Thus, it is important to develop 
simple tools for diagnosis of the failure of such models. 

One source of failure is the choice of threshold. A bad choice may yield 
a poor tail approximation, both because the generalized Pareto distribution 
is inappropriate if u is too small and because independence assumptions 
used to fit the model are invalid: in practice, the observations, and therefore 
the exceedances, are almost always dependent. This dependence is often 
reduced by declustering the exceedances, for example, declaring that those 
lying closer together than a run parameter K belong to the same cluster, 
and fitting (1) only to the largest exceedance of each cluster. However, a 
poor choice of K will give a poor inference, so it is essential to check how 
the results depend on the choices of threshold u and run parameter K. 



(1) 




C = o 
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A key issue is thus the effect of possible model misspecification on infer- 
ence. White (1994) gives conditions under which the maximum likelihood 
estimator derived from a misspecified model is a consistent and normally 
distributed estimator of the parameter that minimizes the Kullback-Leibler 
discrepancy between the true and the assumed models, and constructs tests 
for misspecification. In the context of statistics of extremes, the peaks over 
the threshold model may be justified by a compound Poisson process model 
for the exceedances of a random process above a high threshold [Hsing 
(1987), Hsing, Hiisler and Leadbetter (1988)]. This model, which we outline 
in Section 2.1, can be checked through the projection of the two-dimensional 
limiting point process of exceedances onto the time axis: if the projection is 
misspecified, we should be wary about using peaks over thresholds. 

The main contribution of this paper is to construct diagnostics for the ade- 
quacy of peaks over threshold models. Section 2 introduces inter-exceedance 
times truncated by the run parameter K, which we call K-gaps, and dis- 
cusses the selection of an appropriate run parameter and threshold. We pro- 
pose the use of an information sandwich as a diagnostic for model failure, 
and, as a byproduct, we extend the maximum likelihood estimator of the 
reciprocal mean cluster size, the extremal index, given in Siiveges (2007). 
Section 3 uses data simulated from two autoregressive models and a Markov 
chain model to illustrate the application of our ideas. Section 4 applies them 
to real data, to elucidate nonstationarity and tuning parameter selection and 
to aid extremal index estimation when the basic assumptions of extreme- 
value theory are violated. Section 5 contains a brief discussion. 

2. Theory. 

2.1. Likelihood. We consider asymptotic models for the upper extremes 
of a strictly stationary random sequence X\, . . . ,X n with marginal distri- 
bution function F. A standard approach is to consider the limiting point 
process of rescaled variables M n = Y17=l^i/n,(X -b n )/a n as n ~ * 00 1 where 
the sequences {b n } C R and {a n } C R+ are chosen so that the maximum 
a~ 1 (max{Xj} — b n ) has a nondegenerate limiting distribution G [Resnick 
(1987)]. Under mild conditions, if N n converges in distribution to a point 
process M as n— > oo, this must have the representation [Hsing (1987)] 

oo Ah 
i=l j=l 

where (Si,Xn) are the points of a nonhomogeneous Poisson point process 
with mean measure | - 1 x r(-) on [0, 1) x (xl,xr], | • | is the Lebesgue measure, 
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xl and xr are the left and right endpoints of G and t(x,oo] = — logG(x). 
The Xij are such that, for all i, the variables 

_ -lo g G(X lJ ) 

are the points of a point process ji on [l,oo) with an atom at unity. The 
7i are independent of the nonhomogeneous Poisson process (Si,Xn) and of 
each other, and are identically distributed. Thus, the point process limit of 
N n is a compound Poisson process N comprising independent identically 
distributed clusters, the ith of which has Mj exceedances Yn, . . . ,YiM i that 
may have different sizes but occur simultaneously. 

This result implies convergence in distribution of maxima to the gen- 
eralized extreme-value distribution, convergence of threshold exceedances 
to the generalized Pareto distribution, and convergence of the projection 
M* = Y^i=i $i/n t° a compound Poisson process with points at Si and marks 
Mj. In the limit the inter-exceedance times follow a mixture of an expo- 
nential distribution and a point mass at zero [Ferro and Segers (2003)], and 
this remains true for inter-exceedance times truncated by some fixed positive 
value; see below. For a sequence of thresholds u n , define the inter-exceedance 
times in the sequence {Xi} by 

T(u n ) = min{k > 1 : X k+1 > u n \Xi > u n }, 

and the corresponding if -gaps by 

(u n ) = max{T(n n ) — if, 0}, if = 0,1, 

Then Theorem 1 of Ferro and Segers (2003) can be modified to yield a 
limiting distribution for the if -gaps, which Siiveges (2007) gave for K = 1. 
The proof requires only small modifications of the original, by considering 
pr{F(u n )[T(u n ) — K] > t}, where F(u n ) = 1 — F(u n ). Let Tij(u n ) denote the 
(T-field generated by the events X r < u n ,r = i, . . . ,j. For any A G Ji^^) 
with pr(^4) > 0, B G Fk+i,n{u n ) and k, I integers such that k = 1, . . . , n — I, 
define 

a*(n, I) = max sup | pr(£> | „4) — pr(£>)|. 

k A,B 

Then we have the following result. 

Theorem 2.1. Suppose there exist sequences of integers {r n } and of 
thresholds {u n } such that as n — >■ oo, we have r n — > oo, r n F(u n ) — > r and 
pr{A/ rn < u n } — > e~ dT for some r G (0,oo) and 6 G (0, 1]. Moreover, assume 
that there exists a sequence l n = o(n) for which a*(cr n ,l n ) — > as n — > oo 
for all c > 0. Then as n — )■ oo, 

(2) pr{F(u n )S ( - K \u n )>t}^9exp(-et), t>0, 
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where the extremal index 9 lies in the interval (0, 1] and is the reciprocal of 
the mean cluster size, that is, E(Mj) = 9~ 1 . 

Equation (2) corresponds to a limiting mixture model for the intervals 
between successive exceedances: with probability 9 the interval is an ex- 
ponential variable with rate 9, and otherwise it is of length zero, yielding 
a compound Poisson process of exceedance times and a likelihood-based 
estimator of 9. Suppose that N observations from a stationary random se- 
quence X\, . . . ,X n exceed the threshold u n , let the indices {ji~.Xj i > u n } 
denote the locations of the exceedances, let Tj = — ji denote the inter- 
exceedance times, and let £} = max(Tj — K, 0) denote the ith if-gap, 

for i = 1, . . . ,N — 1 and K = 0, 1, Assuming independence of the gaps 

•Sjv^i) the limiting distribution (2) leads to the log likelihood 

N-i 

(3) t K (9- Sf ] ) = (N - 1 - N c ) log(l - 9) + 2N C \og9 - 9^ F{u n )s\ K) 

i=l 

for 9, where Nc = ^i^ 1 1{S^ ^ 0), and to a closed-form maximum likeli- 
hood estimator 9 n , which is the smaller root of a quadratic equation. Below 
we modify the log likelihood (3) to allow nonstationarity to be detected by 
using smoothing [Fan and Gijbels (1996), Siiveges (2007)]. 

2.2. Model misspecification. The point process approach can fail because 
the assumptions of strict stationarity and independence at extreme levels 
are violated, but even if they are fulfilled, the chosen threshold parame- 
ter u n and the run parameter K may be inappropriately small, thereby 
leading to a poor extreme-value approximation or to dependent threshold 
exceedances. In order to detect such difficulties, we turn to classical work on 
model misspecification [White (1982)]. Under broad assumptions, the max- 
imum likelihood estimator derived from a misspecified likelihood £(0) exists 
as a local maximum of £(6). When the true model £q is not contained in 
the postulated model family, that is, there is no 6>o such that £q = £{9q), this 
estimator is consistent for that parameter value 9* within the misspecified 
family £(9) that minimizes the Kullback-Leibler discrepancy with the true 
distribution. Define J{9) = E {£'{9, Sj) 2 }, 1(9) = -E {£"(9, Sj)}, where the 
prime denotes differentiation with respect to 9, and Eo means expectation 
under the true model, and their empirical counterparts 

N-l N-l 

J n (9) = (N- l)- 1 ^ £'(9, S 3 f, I n (9) = -(N - I)" 1 ^ t»(0, Sj). 

3=1 0=1 
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Under regularity conditions satisfied by the limiting distribution (2) when < 
9 < 1, Theorem 3.2 of White (1982) yields that, asn^oo, 

Vn(O n - #*) A N{0, 1(9*y 2 J(0*)}, l n 0n)~ 2 Jn0n) ^ /(0*)~ 2 J(^), 

where —> and denote weak and almost sure convergence, respectively. 
Thus, the estimator derived from (3) using an arbitrary run parameter K 
is consistent for the value 9* minimizing the Kullback-Leibler divergence 
with the true distribution, and is asymptotically normally distributed with 
the sandwich variance I{9 if )~ 1 J{9*)I{9*)~ 1 , which can be estimated by its 
empirical counterpart evaluated at 9 n . 

It is straightforward to verify that the above theory applies to the log 
likelihood (3), so that as u n increases in such a way that N — > oo, the cor- 
responding maximum likelihood estimator is consistent and asymptotically 
normal. 

2.3. Diagnostics: the information matrix test. Tests to detect model mis- 
specification may be based on the fact that the Fisher information for a 
well-specified regular model equals the variance of the score statistic, that 
is, J(9)=I(9). If we write [White (1982)] 

(4) D(9) = J(9)-I(9), 

then one possible misspecification test amounts to testing the null hypothesis 
Ho : D{9) = against the alternative H\ : D{9) ^ 0. Let d(sj,9) denote the 
one-observation version of D(9), let D n {9) = n~ ls ^™ =l d{sj,9) denote the 
empirical counterpart of D(9), and let 

V(9) = E{[d(S J ,9) + D\9)I(9r 1 £'(9,S,)] 2 } 

and V n {9) denote the asymptotic variance of D(9) and its empirical coun- 
terpart. Detailed formulae are given in the Appendix. Under mild regularity 
conditions, White (1982) proves the following theorem, here given for a scalar 
parameter. 

Theorem 2.2. If the assumed model £(Sj,9) contains the true model for 
some 9 = 9 , then as n -+ oo, ^h~D n {9 n ) ^ N{0, V(9 )), V n {9 n ) ^ V(9 ), 
and the test statistic T(9 n ) = nD n (9 n ) 2 V n {9 n )~ l is distributed as xf- 

To check the quality of this chi-squared approximation, we performed 
simulations from the AR(1) and AR(2) processes described in Section 3, 
using choices of threshold and run parameter under which the models are 
well specified. Probability plots of the simulated T(9 n ) showed that the \\ 
approximation is good for N > 80, and tends to be conservative if N < 80. 
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AR(1) AR(2) Markov chain 




Extremal index estimate 




Fig. 1. Illustration of the diagnostics, based on data simulated from an AR(1) model (left 
column), an AR(2) model (middle column) and a Markov chain (right column). The top 
row gives an impression of each series, together with its 0. 95 and 0. 99 quantiles ( dashed 
and dotted horizontal lines, respectively). The second row shows the information matrix 
test T(9) (gray surface) and its 5% critical value x?(0.95) =3.84 (thick dashed black line 
around the box), with the values above 3.84 accentuated by black blobs. The third row shows 
the estimated extremal index 6 (black surface ) as a function of the run parameter K and 
threshold u, with the true value of 6 (thick dash-dotted black line). 



Thus, relying on this approximation can lead to a loss of power when the 
number of exceedances is small, in which case it is difficult to detect mis- 
specification anyway. Below we shall use the chi-squared quantiles without 
further comment. 

3. Simulated examples. For a numerical assessment of the ideas in Sec- 
tion 2, we apply them to three processes: 

AR(1): Yi = 4>Yi^\ + Zi with (p = 0.7 and standard Cauchy, with K = 1 
and 6 = 0.3; 

AR(2): Yi = <f>iYi-i + + Z u with fa = 0.95, <p 2 = -0.89 and Z< 

Pareto with tail index 2, with K = 5 and 9 = 0.25; 

Markov chain: With Gumbel margins, a symmetric logistic bivariate dis- 
tribution for consecutive variables and dependence parameter r = 2 [Smith 
(1992)], with (9 = 0.33 and K unknown. 
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AR(1) AR(2) Markov chain 




0.95 0.96 0.97 0.98 0.99 0.95 0.96 0.97 0.98 0.99 0.95 0.96 0.97 0.98 0.99 



F(u) F(u) F(u) 

Fig. 2. Root mean squared error (top row) and relative bias (bottom row) of the K-gaps 
maximum likelihood (solid), the iterative least squares [Siiveges (2007), dashed] and the 
intervals (dotted) estimators on the AR(1) process (left panels), on the AR(2) process 
(middle panels) and on the Markov chain (right panels). K = 1 for the AR(1) process, 
K = 6 for the AR(2) process, and K — 5 for the Markov chain. The number of observations 
was n = 30,000. 

For each process we generated series of length n = 8000 and obtained se- 
quences of inter-exceedance times; the top row of Figure 1 shows a short 
sample with a typical extreme cluster from each process. We then calculated 
the maximum likelihood estimates 9 for K = 1, ... ,12 and thresholds corre- 
sponding to the 0.95,0.955, . . . ,0.995 quantiles. The second row of Figure 1 
shows the resulting surfaces for the information matrix statistic T(8). The 
lowest panels show the estimated extremal index. For the AR(1) process, 
the information matrix test suggests misspecification for the combination 
of low thresholds with small run parameter K, but this disappears when u 
or K is increased. For the AR(2) process, the information matrix test in- 
dicates clear misspecification for most thresholds when K < 5, and there is 
then also substantial overestimation of the extremal index. The information 
matrix test for the Markov chain suggests that although well-specifiedness 
cannot be rejected for K = 1 and 2, inference with a larger run parameter 
will be more reliable. Correspondingly, the extremal index estimate is closer 
to the true value for larger run parameters. 

To assess the quality of the extremal index estimator based on (3), a simu- 
lation study was performed with 1000 replications of each of these processes, 
using K = 1 for the AR(1) process, K = 6 for the AR(2), and K = 5 for the 
Markov chain as suggested by the misspecification tests. We simulated pro- 
cesses of lengths n = 2000 and n = 30,000, and used thresholds corresponding 
to the 0.95, 0.96, 0.97, 0.98 and 0.99 quantiles. The median relative bias and 
the root mean squared error for the case n = 30,000 are shown in Figure 2. 
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Information Matrix Test 




Extremal index estimate 




Fig. 3. The behavior of the information matrix test as a function of the dependence 
range of the process. The top row shows three ARIMA(1, 0, d) sequences, with d~0 (left 
column), d = 0.3 (middle column) and d = 0.49 (right column), and the second row their 
autocorrelation functions. The third row contains the information matrix test T(S) (gray 
surface), with the thick dashed black lines indicating the critical value for the information 
matrix test Xi(0.95) = 3.84, and the black blobs indicating T{8) > 3.84. The fourth row 
presents the extremal index estimate, with the thick dash-dotted lines representing the 
theoretical value 8=1 for the case d = 0. 



The plots confirm that if a suitable run parameter is chosen, then the max- 
imum likelihood estimator has generally lower bias and root mean squared 
error than the most commonly used competitor, the intervals estimator, and 
another good estimator based on an iterative weighted least squares fit to 
the longest inter-exceedance times [Siiveges (2007)]. 
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In order to explore the behavior of the misspecification tests in the case of 
long-range dependence, we simulated fractionally differenced ARIMA(1, 0, d) 
processes of length n = 8000, with Gaussian white noise innovations, au- 
toregressive parameter 0.5 and difference parameter d = 0,0.3 and 0.49. The 
sequence with d = corresponds to a stationary normal AR(1) process, and 
therefore has extremal index 6 = 1. For the other cases, no theoretical cal- 
culations are known to us concerning the existence of the extremal index. 
The autocorrelation functions of the data, shown in the second row of Fig- 
ure 3, show long memory when d > 0. The test statistic, plotted in the third 
row of Figure 3, shows a lengthening dependence range: misspecification is 
indicated at thresholds up to u = F~ l (0.985) and K < 8 for d = 0.3, and 
at all thresholds and K < 8 for d = 0.49. The absence of misspecification at 
very high thresholds for d = 0.3 may be due to the effect of increasing the 
threshold while keeping the length of the series fixed. 

4. Data examples. 

4.1. Neuchdtel daily minimum summer temperatures. For a first applica- 
tion to real data we use daily minimum summer temperatures from Neuchatel 
from 1 January 1901 to 31 May 2006. Climatic change raises the question 
whether changes in temperature extremes can be summarized simply by a 
smooth variation of the mean and the variance of the entire temperature 
distribution, or whether there are additional changes in the extremes. The 
daily summer minimum temperatures at Neuchatel show a strong trend in 
the averages, and we investigate whether this is accompanied by a change in 
the clustering of the extremes. The data have been carefully homogenized, 
so such changes should not be due to changes in instrument siting or type, 
or urban effects. 




i i i i i i i i i r 

1910 1930 1950 1970 1990 



Fig. 4. The deseasonalized Neuchdtel daily minimum temperatures (black) with their 
trend, estimated by a cubic spline- smoothed 10-year moving median (heavy white line). 
The horizontal line is intended to help appreciate the trend. 
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Fig. 5. Classical threshold selection plots for three Jfl-year windows centered on 1925, 
1955 and 1985 for the Neuchdtel daily minimum temperature anomalies, showing the pa- 
rameter estimate (bold) and pomtwise 95% confidence limits (solid) as functions of the 
threshold. The dashed lines show the average estimates for the different thresholds. 

We stationarized the raw data by first centering and scaling by the annual 
median and median absolute deviation (MAD) cycle, and then de-trending 
by using a ten- year moving median and MAD. Below we treat the result- 
ing standardized temperature anomalies for the months June-August in 
successive years as a continuous time series. Figure 4, which shows the de- 
seasonalized series before de-trending, shows a strong irregular variation in 
the mean. The presence of trend also motivates a careful misspecification 
analysis, not only for appropriate estimation of the extremal index, but for 
the assumption of stationarity. 

Initially assuming stationarity of the anomalies in the 1901-2006 period, 
we applied the threshold selection procedures mentioned in Section 1 and 
described in more detail by Coles [(2001), Section 4.3] to the entire sequence. 
There seems to be stability above the 0.98 quantile, and generalized Pareto 
analysis of the complete sequence showed acceptable diagnostics. However, 
Figure 5, which shows these plots when applied separately to three 41-year- 
long periods centered on the years 1925, 1955 and 1985, casts some doubt 
on the overall results. 

We therefore checked model misspecification as a function of time, by 
centering 41-year long windows successively at 15 July of each year, and 
calculating the information matrix test defined by equation (4), for every 
combination of threshold u and run parameter K. The calculations thus 
gave 106 sets of T{6) values, with extremal index estimates for all (u, K) 
pairs for the sequence of anomalies. 
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1906 1915 1924 




Fig. 6. Misspecification as a function of time for the Neuchatel summer daily minimum 
temperature anomalies. Horizontal foreground axis: threshold u as F(u); horizontal left 
axis: run parameter K; vertical axis: T(0). The thick dashed lines around the box cor- 
respond to the critical 0.95-quantile of the Xi distribution, the black blobs emphasize the 
parameter combinations where T{&) > xi(0.95). Years are indicated above the plots. 
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Fig. 7. Comparison of the maximum likelihood estimator using K = 4 and F(u) — 0.99 
(heavy solid in both panels) to the intervals (top panel, thick dashed) and to the itera- 
tive weighted least squares (bottom panel, thick dotted) estimators. The 95% confidence 
intervals are shown by thin versions of the lines. 



The three main potential sources of misspecification here are threshold se- 
lection, the choice of run parameter and possible nonstationarity, so the test 
statistics proposed above must depend on these. Figure 6, which presents the 
surfaces of T{9) for twelve different years, suggests misspecification in the 
period 1935-1970 for thresholds around the 0.97- and 0.98-quantile for all 
run parameters. Smaller instabilities were also found, mostly between 1985- 
2000, though these rarely exceeded the critical x^-quantile. These two peri- 
ods roughly coincide with the strongest nonstationarity in the summer mean 
temperatures: see Figure 4, in which the most marked periods of change in 
the 10-year median are in the 1940s and in the 1980s. The threshold sug- 
gested by classical selection methods, the 0.98-quantile, should be avoided: 
the ridge indicating misspecification is located at this threshold. 

Although our motivation for testing is different from the usual one leading 
to the definition of the false discovery rate (FDR), the multiple testing setup 
and the dependence between information matrix tests applied successively 
in sliding windows should be taken into account when we want to justify 
the existence of a misspecification region. Our main interest does not lie 
in the discovery of regions where the null hypothesis of well-specifiedness 
is rejected at a given level of FDR [Benjamini and Hochberg (1995)], but 
in finding those where it cannot be rejected. Nevertheless, we tested the 
significance of the departure from the null hypothesis by the procedure pro- 
posed in Benjamini and Yekutieli (2001) separately for each u, K pair, and 
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found that the misspecification is significant for all K and for F(u) =0.98 
between roughly 1940-1965. The information matrix test was closest to zero 
at the point F{u) = 0.99, K = 4 over the whole century, so we chose these for 
smooth estimation of the extremal index. Figure 7 shows the resulting locally 
constant weighted K-gaps estimates, compared to intervals and to iterative 
weighted least squares estimates based on 41-year long sliding windows. The 
confidence intervals for the -ff-gaps estimator are based on asymptotic nor- 
mality. Nonparametric bootstrap intervals were also calculated, but showed 
only slight differences mostly in the middle of the century, where the boot- 
strap interval was slightly wider. The value of dips in the 1950s and in 
recent years, but overall any evidence for changes in the clustering of summer 
minimum daily temperatures seems to be weak. 

The information matrix test suggests the existence of fluctuations in the 
time point process of extreme anomalies. Using the ten-year moving window 
to de-trend the series, and a 41-year window for the information matrix tests 
and the estimation of the extremal index, only the combination of a high 
threshold and a relatively high run parameter seem to yield a well-specified 
model. The period where the models are misspecified roughly coincides with 
a local peak in the 10-year moving median of the data set. Changes in the 
median temperatures may be accompanied by changes in clustering char- 
acteristics, but perhaps using a 10-year moving window for de-trending is 
insufficient to remove mean fluctuations, so the anomalies display traces 
of residual nonstationarity that then appear in the 41-year moving win- 
dows. Our investigation thus emphasizes the importance of an appropriate 
treatment of long-term trends. Many studies of climate extremes use vary- 
ing thresholds based on local quantile estimation or on an assumption of 
a trend of simple parametric form, the first of which corresponds to our 
ad hoc selection of window length for de-trending; see, for example, Kharin 
and Zwiers (2005), Nogaj et al. (2006) or Brown, Caesar and Ferro (2008). 
Our results indicate that model quality is highly sensitive to such choices, 
so it is necessary to check whether the models are well specified, in order to 
avoid biased estimates with underestimated variances. Climatological stud- 
ies commonly directly compare periods of a few decades, which are assumed 
stationary, but this too should be performed with care. In our study, the 
time-scale of the fluctuations found at extreme levels is shorter than a few 
decades on thresholds u < .F -1 (0.99) in the period 1940-1965. As this is 
a period where the global mean temperature based on observational data 
has a marked local peak, this might arise at other stations also, and other 
climate variables may also show instability on such time-scales. 

4.2. Daily rainfall in Venezuela. The rainfall data recorded daily be- 
tween 1 January 1961 and 31 December 1999 at Maiquetia airport in Venezuela 
provide a striking example of the difficulties of using simple extreme-value 
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methods for risk estimation. Prior to December 1999, the annual maxima of 
the data set were fitted with a Gumbel model, with no diagnostics indicat- 
ing a bad fit. However, after an unusually wet fortnight in December 1999, 
extensive destruction and around 30,000 deaths [Larsen et al. (2001)] were 
caused by three consecutive daily precipitation totals of 120, 410.4 and 290 
mm, the largest of which was almost three times greater than the maximum 
of the preceding 40 years. The return period for the peak value of 410.4 mm 
under simple models can be expressed in thousands or even millions of years. 
Why do classical extreme-value methods fail so catastrophically, and could 
more sophisticated methods have given a different return period estimate 
for such an event? 

Coles and Pericchi (2003) apply a Bayesian approach to the point pro- 
cess representation [Smith (1989)], which is essentially equivalent to fitting 
the generalized Pareto distribution. They use a threshold corresponding ap- 
proximately to the 0.96-quantile and including all exceedances, and argue 
in favor of partitioning the sequence into two seasons, the one of interest 
being from mid-November to April. With these refinements, they obtain a 
predictive return period of approximately 150 years for 410.4 mm. However, 
the classical threshold selection plots for these months, shown in the left 
three panels of Figure 8, indicate trouble with the model: parameter sta- 
bility is compatible with the confidence intervals only for thresholds much 
higher than the 0.96-quantile. Ignoring the tendency of extremes to cluster 
may also have implications for the estimates and their variance, because the 
independence assumption is violated. Moreover, which estimate should we 
choose if different methods give very different answers? 

Following Coles and Pericchi (2003) and backed by meteorological infor- 
mation, we took only the months December- April, initially excluding De- 
cember 1999, and calculated our diagnostics for u G [F" 1 (0. 95), F" 1 (0.995)] 
and K = 1, ... ,12. The graph of the statistic T(9) in the rightmost panel of 
Figure 8 displays three clear features: a region of misspecification at thresh- 
olds below the 0.96-quantile; a ridge along the 0.98-quantile with relatively 
higher values than the surrounding region, corresponding to the most marked 
instability interval in the classical threshold selection plots; and higher test 
statistic values for K = 1,2, implying misspecification for combinations with 
F(u) < 0.97, K < 2. The best regions appear to be either F(u) « 0.97, 
K = 3 or F(u) ~ 0.99, K = 3, but there is no further indication which is 
preferable, and because of the ridge, the existence of a contiguous area of 
well-specifiedness is doubtful. 

The generalized Pareto model (1) was fitted with F(u) = 0.97, K = 3 
and F(u) = 0.99, K = 3. The resulting parameter estimates and standard 
errors are very different: ^ = 0.27 (0.14), a = 14.8 (2.4) for 0.97 quantile, 
and £ = —0.03 (0.14), a = 26.6 (5.3) for the 0.99 quantile. Corresponding 
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Fig. 8. Classical threshold selection plots for the Venezuelan daily rainfall data for the 
months December- April, between January 1961 and April 1999. The three panels on the 
left show the mean excess plot and the modified scale and the shape parameters of GPD 
fits as a function of threshold. The rightmost panel is the statistic T{9) as a function of 
the run parameter K and the threshold u on the probability scale F(u). 

F(u) = 0.97, K = 3 F(u) = 0.99, K = 3 
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Fig. 9. Diagnostic and return level plots for the generalized Pareto fits to the Venezuelan 
rainfall data. The fits excluding December 1999 are shown in the top row. The left pair 
of plots presents the quantile- quantile and return level plots using the 0.97-quantile as the 
threshold, the right pair those using the 0.99- quantile. The run parameter was K = 3 for 
both. The same plots for model fits including December 1999, with the same thresholds and 
run parameters, are presented in the bottom row. Each panel shows the ordered data (x) 
with a line representing the fitted model and a pomtwise 95% envelope. 

diagnostic plots are shown in the upper row of Figure 9, the left two plots 
referring to the lower threshold, and the right two to the higher. Neither 
model seems poor, but they give very different return periods for the value 
410.4 mm: approximately 600 years for the model with threshold F(u) = 
0.97, and an unreasonable 30 million years for the other. The catastrophe 
seems to be compatible only with the lower threshold model, despite the 
fact that in extreme-value statistics, models above higher thresholds are 
generally considered to be closer to the limiting distribution. Inclusion of 
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December 1999 does not change the misspecification tests much, but changes 
the estimates to £ = 0.5 (0.15), a = 12.6 (2.2) for the lower threshold and 

£ = 0.27 (0.15), a = 25.1 (5.2) for the higher one. The corresponding return 
periods become 65 and 300 years, respectively. A look at the diagnostic plots 
in the bottom row of Figure 9 confirms that the former model admits the 
catastrophe quite smoothly, whereas it remains an outlier in the latter. 

One explanation of this apparent paradox might be that the underlying 
process is a mixture. Rainfall is generated by different atmospheric processes, 
such as convective storms, cold fronts or orographic winds. If so in this case, 
the observed extreme process corresponds to the extremes of a mixture dis- 
tribution Y^iLiPi^i with the component distribution functions Fi appear- 
ing with probabilities pi, where the Fi could have different extreme- value 
limits: with m = 2, for example, F\ might correspond to the short-tailed 
case £ = —1/2 and F2 to the long-tailed case £ = 1. Such a mixture can 
show unstable behavior like that of Figure 8, which hints at the presence 
of at least two components: a more frequent light-tailed one with relatively 
high location and scale parameters, dominating the levels around the 0.99- 
quantile, and a rarer heavy-tailed one concentrated at lower levels and hav- 
ing a smaller scale parameter, but generating extremely large observations 
occasionally. A more sophisticated model for the clustering of extremes also 
suggests a mixture character, but will be reported elsewhere. This failure of 
simple extreme- value techniques is a warning to beware of oversimplification, 
and suggests that an approach linking atmospheric physics and statistical 
methods would provide better risk estimates. 

5. Discussion. Inference about the extremal behavior of a process in- 
volves assumptions such as asymptotic independence at extreme levels and 
stationarity, and also entails the selection of auxiliary quantities such as 
threshold and run parameters in order to apply asymptotic models with 
finite samples. Careful investigation of possible model misspecification is 
therefore essential. 

In this paper we have applied standard methods of detecting misspec- 
ification to the point mass-exponential mixture model (2) for the inter- 
exceedance times. These tests assist in the selection of the threshold and 
the run parameter K and thus help to provide better estimates of both the 
extremal index and of the generalized Pareto distribution. Failure of the 
model (2) indicates failure of a more general limit, and consequently of the 
GPD approximation (1). Analysis of the Venezuelan rainfall data shows that 
misspecification tests can provide a valuable supplement to classical thresh- 
old selection procedures, can lead to improved models and better variance 
estimates, and may yield further insight into the structure of the data. 

We have also described a maximum likelihood estimator for the extremal 
index, based on the point mass-exponential model (2) and on the existence of 
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a selection procedure for K. The maximum likelihood estimator is consistent 
and asymptotically normal under an appropriate choice of K, and shows 
small asymptotic bias and root mean square error compared to the best 
competing estimators. The joint application of the misspecification tests 
and the smoothed maximum likelihood estimator proved the good properties 
of the procedure as an efficient method to detect violations of underlying 
assumptions such as nonstationarity or indicate other model problems like 
mixture character that cannot be disregarded using finite thresholds. It can 
be therefore a useful aid to fine-tuning parameters of extreme- value models 
or investigating their limitations. 

One natural question is whether the assessment of misspecification for 
extremal models might better be based on (1). The difficulty with this is 
that the rth moment of the score statistic for £ exists only if r£ > — 1, so 
the maximum likelihood estimators of £ and a are regular only if £ > — 1/2 
[Smith (1985)] and the observed information has finite variance only if £ > 
—1/4, and information quantities for (1) have poor finite-sample properties. 
The distribution (2), on the other hand, is regular for < 8 < 1, and so has 
no such disadvantages. 

APPENDIX: FORMULAE FOR THE INFORMATION MATRIX TEST 

Assume that X±, . . . , X n satisfy the necessary conditions for Theorem 2.1. 
For a threshold u n , suppose that N <n observations exceed the threshold 
u n . Let the indices {ji : Xj i > u n } denote the times of the exceedances, and 

let c$ = F(u n )s\ K ^ = F(u n ) max(jj + i — ji — K, 0) denote the ith observed 
K-g&p normalized by the tail probability F(u n ) for i = 1, . . . , N — 1 
and K = 0,1, ... . Then, denoting derivatives with respect to 9 by a prime, 
it follows from the likelihood (3) that 



4(M 
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/(c^ =0) 2J(cT>0) (K) 
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where 1(A) is the indicator function of the set A. Then we can derive the 
one-observation version, the sample mean of the difference D between the 
variance of the score and the Fisher information and the sample variance of 
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the latter as 

d(e . c (K) ) _ 2I(4 K) ><» | IK) 4c f } 

d ^ c * )- 02 + c j ~~e~' 

D n {6) = J n (6) - I n (9), 

JV ~ 1 r ai(c {k) > o) Ar {K) -\ 

j=l 

N-l 

v n {6) = (n — i)" 1 Y.idicf^-D'MerH'^e-c^)}, 

3=1 

and from there, the information matrix test statistic is obtained by substi- 
tuting the appropriate quantities and the estimated value of 6 n K ^ as 

T{e^) = nD n {e^fv n {e^y\ 



Acknowledgments. We gratefully acknowledge helpful comments of Philippe 
Naveau, Jonathan Tawn, two referees, an associate editor and the editor. 

REFERENCES 

Beirlant, J., Goegebeur, Y., Segers, J. and Teugels, J. (2004). Statistics of Ex- 
tremes. Wiley, Chichester. MR2108013 

Beirlant, J., Vynckier, P. and Teugels, J. L. (1996). Excess functions and estimation 
of the extreme-value index. Bernoulli 2 293-318. MR1440271 

Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A prac- 
tical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. 
MR1325392 

Benjamini, Y. and Yekutieli, D. (2001). The control of the False Discovery Rate in 
multiple testing under dependency. Ann. Statist. 29 1165-1188. MR1869245 

Brown, S. J., Caesar, J. and Ferro, C. A. T. (2008). Global changes in extreme daily 
temperature since 1950. J. Geophys. Res. 113 D05115. doi: 10.1029/2006JD008091 . 

Coles, S. G. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer, 
London. MR1932132 

Coles, S. G. and Pericchi, L. (2003). Anticipating catastrophes through extreme-value 
modelling. Appl. Statist. 52 405-416. MR2012566 

Davison, A. C. and Smith, R. L. (1990). Models for exceedances over high thresholds 
(with discussion). J. Roy. Statist. Soc. Ser. B 52 393-442. MR1086795 

Embrechts, P., Kluppelberg, C. and MlKOSCH, T. (1997). Modeling Extremal Events 
for Insurance and Finance. Springer, Berlin. MR1458613 

Falk, M., Husler, J. and Reiss, D. (2004). Laws of Small Numbers: Extremes and Rare 
Events. Birkhauser, Basel. MR2104478 

Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chap- 
man & Hall, London. MR1383587 



20 



M. SUVEGES AND A. C. DAVISON 



Ferro, C. A. T. and Segers, J. (2003). Inference for clusters of extreme values. J. Roy. 

Statist. Soc. Ser. B 65 545-556. MR1983763 
Hsing, T. (1987). On the characterization of certain point processes. Stochastic Process. 

Appl. 26 297-316. MR0923111 
Hsing, T., Husler, J. and Leadbetter, M. R. (1988). On the exceedance point process 

for a stationary sequence. Probab. Theory Related Fields 78 97-112. MR0940870 
Kharin, V. V. and Zwiers, F. W. (2005). Estimating extremes in transient climate 

change simulations. Journal of Climate 18 1156-1173. 
Larsen, M. C, Wieczorek, G. F., Eaton, L. S., Morgan, B. A. and Torres- 
Sierra, H. (2001). Venezuelan debris flow and flash flood disaster of 1999 studied. 

EOS, Transactions of the American Geophysical Union 47 572. 
Leadbetter, M. R., Lindgren, G. and Rootzen, H. (1983). Extremes and Related 

Properties of Random Sequences and Processes. Springer, New York. MR0691492 
Nogaj, M., Yiou, P., Parey, S., Malek, F. and Naveau, P. (2006). Amplitude and 

frequency of temperature extremes over the North Atlantic region. Geophys. Res. Lett. 

33 L10801. doi: 10.1029/2003GL019019. 
Pickands, J. (1975). Statistical inference using extreme order statistics. Ann. Statist. 3 

119-131. MR0423667 

Resnick, S. I. (1987). Extreme Values, Regular Variation and Point Processes. Springer, 

New York. MR0900810 
Smith, R. L. (1985). Maximum likelihood estimation in a class of non-regular cases. 

Biometrika 72 67-90. MR0790201 
Smith, R. L. (1989). Extreme value analysis of environmental time series: An application 

to trend detection in ground-level ozone. Statist. Sci. 4 367-377. MR1041763 
Smith, R. L. (1992). The extremal index for a Markov chain. J. Appl. Probab. 29 37-45. 

MR1 147765 

SOUSA, B. and Michailidis, G. (2004). A diagnostic plot for estimating the tail index of 
a distribution. J. Comput. Graph. Statist. 13 974-995. MR2109061 

SiivEGES, M. (2007). Likelihood estimation of the extremal index. Extremes 10 41-55. 
MR2407640 

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 
50 1-25. MR0640163 

White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge Univ. 
Press, Cambridge. MR1292251 

ecole polytechnique federale 

de Lausanne 
EPFL-FSB-IMA-STAT 
Station 8, 1015 Lausanne 
Switzerland 

E-MAIL: Maria.Suveges@epfl.ch 

Anthony.Davison@epfl.ch 



